Skip to content

fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880

Open
olegsa wants to merge 6 commits intomodelcontextprotocol:mainfrom
olegsa:fix/fetch-encoding-detection
Open

fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880
olegsa wants to merge 6 commits intomodelcontextprotocol:mainfrom
olegsa:fix/fetch-encoding-detection

Conversation

@olegsa
Copy link
Copy Markdown

@olegsa olegsa commented Apr 9, 2026

Summary

  • Add automatic character encoding detection to mcp-server-fetch using charset-normalizer for pages that don't declare charset in the HTTP Content-Type header
  • Introduces get_response_text() helper that checks response.charset_encoding first, then falls back to statistical byte analysis via charset-normalizer
  • Fixes garbled text when fetching pages served in non-UTF-8 encodings (e.g. windows-1251, windows-1255, windows-1256, euc-kr) without a charset declaration in the HTTP header

Motivation

Many websites (especially non-English ones) serve content in legacy encodings like windows-1255 (Hebrew), windows-1251 (Cyrillic), windows-1256 (Arabic), or euc-kr (Korean) without declaring the charset in the HTTP Content-Type header. The current code uses response.text which defaults to UTF-8, producing garbled/mojibake output for these pages.

Changes

  • server.py: Added get_response_text() that uses charset-normalizer for encoding detection when HTTP headers lack charset info. Replaced response.text with get_response_text(response) in fetch_url(). Also moved httpx from local imports to a top-level import.
  • pyproject.toml: Added charset-normalizer>=3.0.0 as an explicit dependency (already a transitive dep via requests).
  • tests/test_server.py: Added TestGetResponseText class with 5 tests covering UTF-8 passthrough, Ukrainian (windows-1251), Hebrew (windows-1255), Arabic (windows-1256), and Korean (euc-kr) encoding detection.

Test plan

  • All existing tests pass
  • New encoding detection tests pass for Ukrainian, Hebrew, Arabic, Korean
  • UTF-8 pages with charset in HTTP header still use the standard path
  • Non-HTML content (JSON, etc.) is unaffected

Made with Cursor

olegsa added 4 commits April 9, 2026 10:32
…tests

Add charset-normalizer as an explicit dependency in pyproject.toml and
add tests verifying correct decoding of non-UTF-8 pages (Ukrainian
windows-1251, Hebrew windows-1255, Arabic windows-1256, Korean euc-kr).

Made-with: Cursor
Copy link
Copy Markdown
Member

@olaservo olaservo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix — the get_response_text() helper is well-structured with a conservative fallback chain (charset header → charset-normalizer detection → default). Good test coverage across multiple encodings.

One nit: test_korean_euc_kr actually encodes the text as UTF-8, not euc-kr. Consider renaming it to test_utf8_without_charset_header, or changing it to actually encode as euc-kr to match the test name.

Otherwise LGTM.


This review was assisted by Claude Code.

olegsa added 2 commits April 20, 2026 20:10
The test was encoding the Korean text as UTF-8, so it did not exercise
the charset-normalizer detection path. Encode with euc-kr and extend
the sample text so statistical detection has enough bytes to work with.

Addresses review feedback on modelcontextprotocol#3880.

Made-with: Cursor
@olegsa
Copy link
Copy Markdown
Author

olegsa commented Apr 20, 2026

Thanks for the review, @olaservo! Good catch on the euc-kr test — you're right, it was silently encoding as UTF-8. Fixed in 37da471: the test now actually calls text.encode("euc-kr") and uses a longer sample so charset-normalizer's statistical detection has enough bytes to work with. This gives us genuine coverage of the detection path for Korean rather than a duplicate of test_utf8_passthrough. All 5 TestGetResponseText tests still pass locally.

@cliffhall cliffhall added bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch labels Apr 20, 2026
@cliffhall
Copy link
Copy Markdown
Member

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @cliffhall's task in 1m 49s —— View job


PR Review: fix(fetch): detect encoding for non-UTF-8 pages

Tasks:

  • Gather context from PR description and comments
  • Read changed files
  • Review server.py changes
  • Review test coverage
  • Review dependency changes
  • Post final review

This is a solid, well-scoped fix. The fallback chain in get_response_text() is sensible and the previous reviewer's concern about test_korean_euc_kr has been addressed. A few observations follow.

server.py

get_response_text() (lines 29–38) — looks correct

The three-level fallback chain is appropriate:

  1. Declared charset in HTTP header → use httpx's native decoding (response.text)
  2. No declared charset → run charset_normalizer.from_bytes statistical detection
  3. Detection fails → fall back to response.text (httpx applies its own heuristics)

response.charset_encoding returns None only when there is no charset= parameter in the Content-Type header, which is exactly the condition you want. API usage of from_bytes(...).best() and str(result) is correct.

Inconsistency: robots.txt still uses response.text directly (line 106)

# check_may_autonomously_fetch_url — line 106
robot_txt = response.text

This is probably fine in practice (robots.txt is defined to be ASCII), but it's a minor inconsistency with the new approach. A short comment explaining why the standard path is acceptable here would add clarity, or you could use get_response_text(response) for uniformity. Fix this →

TestFetchUrl mocks are unintentionally compatible with the new code

The existing TestFetchUrl tests use MagicMock() responses. MagicMock().charset_encoding auto-creates a truthy MagicMock object (not None), so get_response_text takes the first branch and calls response.text — which the tests have explicitly set. This works, but it's fragile. If charset_encoding were ever set to None in one of those mocks, get_response_text would then call from_bytes(response.content) on another MagicMock, likely producing surprising results. Not a blocker, but worth noting.

tests/test_server.py

Using real httpx.Response objects in TestGetResponseText rather than mocks is exactly the right call — it exercises the actual encoding/decoding path with no stubbing. The _build_response helper is clean.

After the fix in 37da471, test_korean_euc_kr now properly calls .encode("euc-kr") and uses a sufficiently long sample for statistical detection. All five encoding scenarios are covered with realistic, semantically meaningful content.

pyproject.toml

requests is now an unused dependency

requests>=2.32.3 is listed in dependencies, but import requests does not appear anywhere in the server code. The PR description notes charset-normalizer was "already a transitive dep via requests" — now that it's a direct dependency, requests can be removed to reduce the dependency footprint. Fix this →

Summary

Area Status
Core logic (get_response_text) ✅ Correct
Encoding test coverage ✅ Good (5 cases, real httpx objects)
euc-kr fix from prior review ✅ Addressed
robots.txt consistency ⚠️ Minor inconsistency
requests dependency ⚠️ Appears unused, could be removed
TestFetchUrl mock fragility 💬 Nit

The two items worth addressing before merge are removing the unused requests dependency and the robots.txt inconsistency; everything else is a minor nit. Good work on the fix overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants